Download .Rmd (won’t work in Safari or IE)
See GitHub Repository

#purrr
In my opinion, purrr is one of the most underrated and under-utilized R packages. It has completely revolutionized my own efficiency and workspace organization, particularly as someone who works with super messy data that comes in a variety of forms.

In this tutorial, we are going to cover a number of what I believe are the most functional and important applications of purrr in psychological research. Given the audience, in the first half of the tutorial, I will focus on working with the diverse forms of data that many of you work with, providing examples of how to load, clean, and merge data using purrr. In the second half, I will focus on how we can use purrr with longitudinal data analysis when we are working with multiple predictors and outcomes.

Background: Iteration

Before we get there, though, I think it’s useful to think about when and where we would use purrr.

Iteration is everywhere. It underpins much of mathematics and statistics. If you’ve ever seen the \(\Sigma\) symbol, then you’ve seen (and probably used) iteration.

It’s also incredibly useful. Anytime you have to repeat some sort of action many times, iteration is your best friend. In psychology, this often means reading in a bunch of individual data files from an experiment, repeating an analysis with a series of different predictors or outcomes, or creating a series of figures.

Enter for loops. for loops are the “OG” form of iteration in computer science. The basic syntax is below. Basically, we can use a for loop to loop through and print a series of things.

## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"

The code above “loops” through 5 times, printing the iteration letter.

Essentially, like the apply(), lapply(), sapply(), and mapply() family of functions, purrr is meant to be an alternative to iteration (i.e. for loops) in R. for loops are great, but they aren’t as great in R as they are in other programming languages. In R, you’re better off vectorizing or building in C++ backends.

There are a lot of functions in the purrr package that I encourage you to check out. Today, though, we’ll focus on the map() family of functions. The breakdown of map functions is pretty intuitive. The basic map function wants two things as input – a list or vector and a function. So the purrr equivalent of the example above would be:

## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"
## [[1]]
## [1] "a"
## 
## [[2]]
## [1] "b"
## 
## [[3]]
## [1] "c"
## 
## [[4]]
## [1] "d"
## 
## [[5]]
## [1] "e"

Note that this returns a list, which we may not always want. With purrr, we can change the kind of output of map() by adding a predicate, like lgl, dbl, chr, and df. So in the example above, we may have wanted just the characters to print. To do that we’d call map_chr():

## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"
## [1] "a" "b" "c" "d" "e"

Note that it also returns the concatenated character vector as well as printing each letter individually (i.e. iteratively).

map() functions can also hand multiple inputs. Often we may need to input multiple pieces of information to a function, similarly to how we work with nested for loops. In this case, we have map2() and pmap() that take additional arguments. map2() shockingly takes two inputs and pmap() takes p arguments that you feed in as list (e.g. pmap(list(a, b, c, d), my_fun)). A simple printing example would be:

## [1] "a 1" "b 2" "c 3" "d 4" "e 5"

Note here that we can use map2() and pmap() with the predicates from above.

This likely makes little sense at this point, and that’s fine. The examples in the rest of this tutorial should elucidate their usage. The last note I’ll make is that thinking about the structure of your data is going to be very important when using purrr. To use it effectively, you’ll need your data in specific forms, which will often require data manipulations. It just takes practice.

Regardless of the programmatic form, iteration is everywhere. It underpins much of mathematics and statistics. If you’ve ever seen the \(\Sigma\) symbol, then you’ve seen (and probably used) iteration.

It’s also incredibly useful. Anytime you have to repeat some sort of action many times, iteration is your best friend. In psychology, this could mean reading in a bunch of separate data files (with separate files for different people, variables, waves, etc.) or performing a number of regressions or other statistical tests.

Data Reading

To demonstrate the first case in which I find purrr useful, we are going to consider a five cases that, in my experience, capture many of the challenges we often face in working with psychological data. In each of these cases, we will use a codebook of the form we discussed in the previous tutorial on codebooks.

All of these share a similar feature: multiple files. There are a variety of other techniques you could use to get your data into a usable form, such as those below:

  1. Writing code to load in each file separately (not good).
  2. Copying each data file into one larger data set in Excel (worse)

But let’s not do that. Let’s use iteration to make our process efficient and transparent.

Many Subjects, Same Variables (Example 1)

We will start with a data storage format that is very common in experimental studies in various fields of psychology as well as in observational studies of repeated assessments of individuals (i.e. ESM, EMA, etc.).

For this first example, I’ll show you how this would look with a for loop before I show you how it looks with purrr.

Assuming you have all the data in a single folder and the format is reasonably similar, you have the following basic syntax:

This works fine in this simple case, but where purrr really shines in when you need to make modifications to your data before combining, whether this be recoding, removing missing cases, or renaming variables.

But first, the simple case of reading data.

The code above creates a list of ID’s from the data path (files named for each person), reads the data in using the map() function from purrr, removes the “.csv” from the ID variable, then unnests the data, resulting in a data frame for each person.

But often, we have variable names that aren’t super informative, so we want to rename them. In this case, we need to use our codebook to give them more informative variable names.

In this case, where all people have the same variables, it’s easiest to just rename them after unnesting, so the full code would look like this:

Many Subjects, Different Variables (Example 2)

In some cases, participants may have different variables. This could be do to a skip rule in a study or intentionally different variable collection (e.g. in between-person experiments or idiographic work like I do). In this case, we might need to filter or rename variables within our iterative loop.

In this case, all participants have the same set of core variables but were randomly assigned to complete one additional scale.

Multiple Waves, Same Variables (Example 3)

In some cases, instead of multiple files for each participant, we collect a single file for all participants across different waves (e.g. using Qualtrics). In this case, we need to index the files a little differently. Instead of reading in files for participants, we need to read in files for waves, which may be named in a variety of ways.

Here, I’ll start with a simple example of data that were well-managed and nicely named the same except for wave content. This is a good practice to do. I’m in general against modifying data, but I am a fan of changing file names because I think this actually helps with data management and prevents the need to actually go in and modify information within files.

These data come from a longitudinal study of personality. We have seven waves, and the variable names for all items are consistent across waves. In this case, our code is almost identical to reading in multiple files for each participant, except that now we have wave info and will need to toss out part of the file names at the end.

The only change from the code for reading in multiple files for participants is that we have “wave” as a variable instead of “ID” and we use the str_extract_all() function from the stringr package (part of tidyverse) to get rid of everything except the numeric wave value.

Multiple Waves, Different Variables (Example 4)

Oftentimes, however, we do not have the same variables across waves or they do have the same names across waves. In those cases, we’ll have to do a little extra work to get our data into a form where we can unnest() them – that is where shared column names will actually be shared.

We’ll start with the case where we have some additional information (e.g. demographics) in the first wave.

These data are the same as we used in the previous example except that I changed the names and added demographic information for this example. This means that we have slightly different information in wave one and need a way to match the same variables across waves. We’ll use our codebook to achieve this with little issue!

However, because of this, we’ll need to use a function that take the year as input, so that we pull the correct variables from the codebook.

Multiple Waves, Multiple Files for Same Variables (Example 5)

In other cases, we may have multiple types of files for different waves. Across waves, those variables may be the same or different, but we’ll focus on the case when we largely want the same variables.

Running Models

Another really powerful feature of purrr is keeping your data, models, tables, plots, etc all conveniently indexed together. Often we need to do this for multiple DV’s or predictors, and you may end up with an environment that looks something like E_fit1, A_fit1, E_fit2, A_fit2 and so on. There’s nothing wrong with this. But eventually you’ll want to pull out coefficients, plot results, etc., and it’s easy to make a copy and paste error or name different types of objects inconsistently, which can be difficult both for future you or someone else using your code.

Before we can learn how to use purrr for this, we need to understand what a nested data frame is. If you’ve ever worked with a list in R, you are halfway there. Basically a nested data frame takes the normal data frame you are probably familiar with and adds some new features. It still has columns, rows, and cells, but what makes up those cells isn’t restrictred to numbers, strings, or logicals. Instead, you can put essentially anything you want: lists, models, data frames, plots, etc!

If this sounds scary, it will hopefully become clearer if we use our read in data from above to run, table, and plot some basic longitudinal models of our data.

Read in Data

Clean Data

Now the data are all loaded in and have been given informative variable names, but we still need to do some data cleaning for the personality data.

We’ll start by selecting only the personality variables and reverse-scoring them. Then we’ll create composites. To do so, we’ll again use our codebook.

Descriptives

Descriptive Statistics of Study Variables
Extraversion
Agreeablness
Conscientiousness
Neuroticism
Openness
Year M SD N M SD N M SD N M SD N M SD N
2005 5.14 1.17 10451 5.78 1.00 10451 6.21 1.00 10451 4.72 1.23 10451 4.46 1.28 10451
2009 5.09 1.17 10327 5.66 1.01 10327 6.13 1.00 10327 4.85 1.23 10327 4.36 1.28 10327
2013 3.71 2.08 15544 4.04 2.27 15544 4.30 2.46 15544 5.98 1.64 15544 0.88 4.75 15544

Coding Time

It’s important to remember how we code time. There are several ways we can do it. For now, for simplicity, we will create a new wave variable where 2005 = 0, 2009 = 1, and 2013 = 3, but we could make a lot of other choices depending on our goals.

It’s going to get mad later when I run growth models if I keep people with only one wave, so we’re going to remove them now.

Fit Unconditional Models

Now, here we could run separate unconditional growth models for each of the Big 5 like this:

## Linear mixed model fit by REML ['lmerMod']
## Formula: value ~ 1 + (1 | Procedural__SID)
##    Data: df6_long %>% filter(trait == "E")
## 
## REML criterion at convergence: 62339.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.8959 -0.4997  0.0144  0.5498  3.3758 
## 
## Random effects:
##  Groups          Name        Variance Std.Dev.
##  Procedural__SID (Intercept) 0.7865   0.8868  
##  Residual                    0.5308   0.7286  
## Number of obs: 22187, groups:  Procedural__SID, 8592
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  5.11572    0.01078   474.6

But this would be tedious and prone to error. So instead we will use list columns to do it. We’ll start by using the group_by() and nest() functions from dplyr and tidyr to put the data for each trait into a cell of our data frame:

Now, our data frame is 5 x 2, with the elements in the second column each containing the data frame that corresponds to that trait. This makes it really easy to run our models using the map() family of unctions from purrr.

Below, we will add a new column to our data frame that will contain the unconditional model for each trait.

Now we can see we have a new list column in our data frame called fit0 that contains an S4 class lmerMod, which simply means your growth model. To understand model, I personally find it easiest to visualize it. What this model is telling us is the mean across all observations as well as the between-person variability in that estimate. I find it easiest to plot this. We’ll go over the code for it next week.

ICC

If you remember, what we’re often intersted in with the unconditional model is the ICC (relative between v. within variance), so let’s extract that from the models using the ICC() function from the reghelper package. In this case, we will use a version of map() called map_dbl because we want our result to be a regular numeric column, not a list column.

Fit Growth Models

What we’re starting to see is that we still have a tidy working environment, but we’re still holding onto a lot of info that we can access with relative ease.

But before we get to things like pulling info from our models, let’s go ahead and run our basic growth model with and without a random slope.

Our data frame has expanded to have two more columns.

Let’s visualize the difference between these two models.

Model Comparisons

To decide if we should have a random slope, we typically do nested model comparisons. We can do that here, too. Here, we need to use both fit1 and fit2, so we’ll use the map2() function from purrr to take 2 inputs and use the anova() function to compare them.

To see the results, we can do the following:

## [[1]]
## Data: .
## Models:
## .x[[i]]: value ~ 1 + wave + (1 | Procedural__SID)
## .y[[i]]: value ~ 1 + wave + (wave | Procedural__SID)
##         Df   AIC   BIC logLik deviance  Chisq Chi Df Pr(>Chisq)
## .x[[i]]  4 58533 58565 -29263    58525                         
## .y[[i]]  6 58535 58583 -29261    58523 2.8338      2     0.2425
## 
## [[2]]
## Data: .
## Models:
## .x[[i]]: value ~ 1 + wave + (1 | Procedural__SID)
## .y[[i]]: value ~ 1 + wave + (wave | Procedural__SID)
##         Df   AIC   BIC logLik deviance  Chisq Chi Df Pr(>Chisq)   
## .x[[i]]  4 57394 57426 -28693    57386                            
## .y[[i]]  6 57389 57437 -28688    57377 9.3688      2   0.009238 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## [[3]]
## Data: .
## Models:
## .x[[i]]: value ~ 1 + wave + (1 | Procedural__SID)
## .y[[i]]: value ~ 1 + wave + (wave | Procedural__SID)
##         Df   AIC   BIC logLik deviance  Chisq Chi Df Pr(>Chisq)   
## .x[[i]]  4 62305 62337 -31149    62297                            
## .y[[i]]  6 62298 62346 -31143    62286 11.531      2   0.003134 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## [[4]]
## Data: .
## Models:
## .x[[i]]: value ~ 1 + wave + (1 | Procedural__SID)
## .y[[i]]: value ~ 1 + wave + (wave | Procedural__SID)
##         Df   AIC   BIC logLik deviance  Chisq Chi Df Pr(>Chisq)    
## .x[[i]]  4 66171 66203 -33081    66163                             
## .y[[i]]  6 66146 66194 -33067    66134 28.585      2  6.207e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## [[5]]
## Data: .
## Models:
## .x[[i]]: value ~ 1 + wave + (1 | Procedural__SID)
## .y[[i]]: value ~ 1 + wave + (wave | Procedural__SID)
##         Df   AIC   BIC logLik deviance  Chisq Chi Df Pr(>Chisq)    
## .x[[i]]  4 67780 67812 -33886    67772                             
## .y[[i]]  6 67766 67814 -33877    67754 18.507      2  9.579e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Tabling Values

Looks like we have enough slope variance for all traits but Agreeableness to proceed with the random slope models. We’re going to proceed with the random slope models for all traits for consistency.

The next thing we want to do is actually examine the model coefficients. To do that, I prefer to use the tidy() function from the broom.mixed package.

Now we have a new column called tidy that contains a data frame. But we want to be able to see those values. This is where purrr will really shine once again, especially when coupled with the unnest() from tidyr. Watch:

This is fine, but kind of ugly (I can’t publish this table, and it should be clear right now that I do not like copying and pasting).

The code below is going to clean this up a bit. See if you can figure out what’s going on:

We extracted some elements, but we still aren’t quite ready for publication. Let’s do some reshaping so that we have different rows for terms and different columns for traits.

Now we have it formatted, but let’s make it pretty using the kable() function from the knitr package and the kableExtra package.

Growth Model Terms for the Big 5
Extraversion
Agreeablness
Conscientiousness
Neuroticism
Openness
Effect Term b CI b CI b CI b CI b CI
Fixed Intercept 5.15 [5.13, 5.18] 5.77 [5.75, 5.79] 6.22 [6.20, 6.24] 4.74 [4.72, 4.77] 4.46 [4.44, 4.49]
Slope -0.04 [-0.05, -0.03] -0.06 [-0.07, -0.04] -0.05 [-0.06, -0.03] 0.08 [0.07, 0.10] -0.04 [-0.06, -0.03]
Random Intercept-Slope Correlation -0.20 -0.12 -0.19 -0.28 -0.16
SD Intercept 0.91 0.70 0.70 0.94 0.95
SD Residual 0.71 0.71 0.68 0.79 0.83
SD Slope 0.16 0.11 0.14 0.22 0.20